The Bealtes have been a household name for decades. They are regarded by many as one of the great rock bands of all time, but what was it that made them so popular? Perhaps some exploratory data analysis might provide an answer.

Let’s load our necessary, packages first.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(wordcloud)
## Loading required package: RColorBrewer
library(tidytext)
library(ggplot2)
library(syuzhet)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose

Then we shall load our processed lyrics.

load("../output/processed_lyrics.RData")

artists <- read.csv("../data/artists.csv")

For this analysis, I’m focusing on the beatles within the larger subgroup of rock artists.
I’m splitting the data into beatles and non-beatles (formed around the same time) for comparison and contrast.

artists_60_70s<- artists %>% 
  filter(Formed %in% c(1960:1970)) %>% 
  select(Artist)

beatles_lyrics <- dt_lyrics %>% 
  filter(artist =="beatles" & genre == "Rock")

other_lyrics <- dt_lyrics %>% 
  filter(artist %in% artists_60_70s$Artist & genre == "Rock" & artist != "beatles") 

When comparing the number of stemmed words per song for each group, it seems that the Beatles have a lower average count (51) than similar rock bands (63). But what if the other bands’ distribution is skewed by outliers?

# the number of stemmed words in each song by group
beatles.words_per_song <- sort(sapply(strsplit(beatles_lyrics$stemmedwords, " "), length))
other.words_per_song <- sort(sapply(strsplit(other_lyrics$stemmedwords, " "), length))


# mean, median, and spread
summary(beatles.words_per_song)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   33.00   44.00   51.02   63.00  313.00
summary(other.words_per_song)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   40.00   57.00   62.77   78.00  575.00
# resulting plot
boxplot(beatles.words_per_song, other.words_per_song, horizontal = T, names = c("beatles", "other"), col = "blue")

Even if we remove the higher outliers, the Beatles still have a lower average number of stemmed words. This may imply that Beatles’ songs have on average shorter lyrics.

# finding outliers for each group
outliers1 <- boxplot(beatles.words_per_song)$out

outliers2 <- boxplot(other.words_per_song)$out

# remove outliers 

beatles.words_per_song2 <- beatles.words_per_song[-which(beatles.words_per_song %in% outliers1)]
other.words_per_song2 <-  other.words_per_song[-which(other.words_per_song %in% outliers2)]
# mean, median, and spread
summary(beatles.words_per_song2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   33.00   42.50   47.38   60.00  107.00
summary(other.words_per_song2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   39.00   56.00   59.19   75.00  135.00
# resulting plot
boxplot(beatles.words_per_song2, other.words_per_song2, horizontal = T, names = c("Beatles", "Other"), xlab = "Number of words per song", col = "blue")

Next, I’ll convert the given stemmed words into a corpus and then a term document matrix, since the data was already processed and cleaned.

# Created a function to streamline conversion to term document matrix

stem_to_corpus <- function(x){
  source <- VectorSource(x)
  corpus <- VCorpus(source)
  return (corpus)
}

# respective corpi
beatles.corpus <- stem_to_corpus(beatles_lyrics$stemmedwords)
other.corpus <- stem_to_corpus(other_lyrics$stemmedwords)
# respective tdm
beatles.tdm <- TermDocumentMatrix(beatles.corpus)
other.tdm <- TermDocumentMatrix(other.corpus)

#beatles.tdm <- removeSparseTerms(beatles.tdm, 0.99) # remove lower frequency terms
#other.tdm <- removeSparseTerms(other.tdm, 0.99) # remove lower frequency terms

# Conveting TDMs into dataframes of most frequent terms

# beatles
m.beatles <- as.matrix(beatles.tdm)
f.b <- sort(rowSums(m.beatles), decreasing=T)
beatles.word_freq<- data.frame(word= names(f.b), freq=f.b)

# other
m.other <- as.matrix(other.tdm)
f.other <- sort(rowSums(m.other), decreasing=T)
other.word_freq<- data.frame(word= names(f.other), freq=f.other)

Here are the respective word clouds.

# wordcloud for beatles
set.seed(1)
wordcloud(words = beatles.word_freq$word, freq =beatles.word_freq$freq, max.words = 100, random.order=FALSE,
          colors=brewer.pal(8, "Dark2"))

# wordcloud for other
set.seed(1)
wordcloud(words = other.word_freq$word, freq =other.word_freq$freq, max.words = 100, random.order=FALSE,  
          colors=brewer.pal(8, "Dark2"))

Without even looking at any quantitative summaries, it’s clear that “love” is the most dominant word in both Beatles and Other rock groups.

# top 20 terms
head(beatles.word_freq, 20)
head(other.word_freq, 20)

Based on the top twenty words in the respective dataframes, it seems that there is little difference between the most used words from Beatles lyrics and those of other rock groups of the sixties/seventies.

The words for the most part seem synonymous with similar frequencies; though there is an exception for “cry” (possibly invoking a negative sentiment) which is the 10th most used word in the Beatles lyrics and 20th in other bands.

The relative frequency in the bar plots, table, boxplots and piecharts indicate there is some difference in the distributions of the Beatles words. For example, in terms of frequency, Love is used 976 times in Beatles lyrics and 6432 times in other bands’ lyrics; however, when looking a

# summation of all frequencies for each group
n1<-sum(beatles.word_freq$freq)
n2<-sum(other.word_freq$freq) 

# relative frequencies  for each group
beatles.rf20 <- round(head(beatles.word_freq$freq, 20)/n1, 3)
other.rf20 <- round(head(other.word_freq$freq, 20)/n2, 3)

# relative frequency table
data.table(cbind("Beatle words" = as.character(head(beatles.word_freq$word, 20)), 
                 "freq" = head(beatles.word_freq$freq, 20),
                 "r-freq" = head(beatles.word_freq$freq, 20)/n1))
data.table(cbind("Other words" = as.character(head(other.word_freq$word, 20)),
                 "freq" = head(other.word_freq$freq, 20),
                 "r-freq" = head(other.word_freq$freq, 20)/n2))
# barplots with 
barplot(beatles.rf20, las = 1,names.arg = head(beatles.word_freq$word, 20), col = rainbow(20), horiz = T )

barplot(other.rf20, las = 2,names.arg = head(other.word_freq$word, 20), col = rainbow(20),horiz = T )

Perhaps a more direct comparison will help.

# save top 50 words as character strings

a<- as.character(head(beatles.word_freq$word, 20))  
b <- as.character(head(other.word_freq$word, 20)) 

# intersection of common popular words for beatles and other
v<- a[a %in% b] # 14 words

# return in each group the word 
o<-other.word_freq %>% 
  filter(word %in% v) 

b<-beatles.word_freq %>% 
  filter( word %in% v) 

# merge frequencies by words
top14 <- merge(b, o, by="word")  
top14 <- top14 %>% 
  rename(
    beatles = freq.x,
    other = freq.y
  )
# relative frequency
barplot(c(top14$beatles/n1 , top14$other/n2) ,col = c("red","blue"))

SENTIMENT ANALYSIS

Now, let’s take a look at the sentiments of the different groups’ lyrics.

I’m using the Syuzhet package (with nrc lexicon) on the stemmed words and plotting on histograms to see the general shape of the distribution and boxplot for spread.

Though the sizes are vastly different, the general shapes of the distributions are similar and more or less concentrated at 0. The sentiments of other, according to the boxplot, do seem to have a wider spread, indicating more variation in emotional extremes.

beatles.sentiments.nrc <- get_sentiment(beatles_lyrics$stemmedwords, method = "nrc", language = "english")
other.sentiments.nrc <- get_sentiment(other_lyrics$stemmedwords, method = "nrc", language = "english")

hist(beatles.sentiments.nrc, xlab = "Emotion index (nrc)", main = "Histogram of Beatles Sentiments", col = "blue")

hist(other.sentiments.nrc, xlab = "Emotion index (nrc)", main = "Histogram of Other Sentiments", col = "red")

boxplot(beatles.sentiments.nrc, other.sentiments.nrc, horizontal = T, names = c("beatles", "other"), col = c("blue", "red"), xlab = "Emotion index (nrc)")

Using the

plot(beatles.sentiments.nrc, type = "h")

plot(other.sentiments.nrc, type = "h")

Even a piechart does not indicate a significant difference between the Beatles and other bands of its time, though they are slightly more positive than other bands.

# emotion vector
emotions <- c("anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", "trust", "negative", "positive")
# set sentiments to dataframe
beatles.sentiments.pie  <- data.frame(sentiments = colSums(get_nrc_sentiment(beatles_lyrics$stemmedwords)))
# this one may take a minute
other.sentiments.pie  <- data.frame(sentiments = colSums(get_nrc_sentiment(other_lyrics$stemmedwords)))
# set emotions to dataframes of sentiments
beatles.sentiments.pie$emotions <- emotions
other.sentiments.pie$emotions <- emotions
# print out the sentiments for direct comparison
beatles.sentiments.pie
other.sentiments.pie
# piechart 
plot_ly() %>%
  add_pie(data = beatles.sentiments.pie, labels =  emotions, values = ~sentiments,
          name = "Beatles", domain = list(x = c(0, 0.4), y = c(0.4, 1))) %>%
  add_pie(data = other.sentiments.pie, labels =  emotions, values = ~sentiments,
          name = "Other", domain = list(x = c(0.6, 1), y = c(0.4, 1))) %>%
  layout(title = "Sentiments Pie Charts", showlegend = T,
         xaxis = list(showgrid = F, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = F, zeroline = FALSE, showticklabels = FALSE))

Conclusion: